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Abstract 

In  an  inverted  file  document  retrieval  system,  a  query  is  in  the 
form  of  a  Boolean  expression  of  index  terms.   In  response  to  a  query, 
the  system  accesses  the  inverted  lists  corresponding  to  the  index  terms, 
merges  them  and  selects  from  the  merged  list  those  documents  that  satisfy 
the  search  logic.   In  this  paper,  we  consider  the  problem  of  determining 
a  Boolean  expression  which  leads  to  the  minimum  total  merge  time  among 
all  Boolean  expressions  that  are  equivalent  to  the  expression  given  in 
the  query.   This  problem  is  the  same  as  finding  an  optimal  merge  tree 
among  all  trees  that  realize  the  truth  function  determined  by  the  Boolean 
expression  in  the  query.   Several  algorithms  which  generate  optimal  merge 
trees,  when  the  lengths  of  overlaps  between  different  lists  are  small 
compared  with  the  length  of  the  lists,  are  described.  These  algorithms 
are  no  longer  optimal  when  the  lengths  of  overlaps  cannot  be  neglected. 
In  this  case,  it  is  possible  to  bound  the  performance  of  these  algorithms 
in  some  instances  in  terms  of  the  maximum  overlap  between  lists.   The 
performance  bounds  are  discussed. 


Ill 
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I.   Introduction 

In  this  paper,  we  consider  the  problem  of  parsing  search  queries 
in  inverted  file  document  retrieval  systems.   In  such  systems,  the  index 
file  contains  an  entry  for  each  of  the  index  terms  selected  as  descriptors 
for  the  documents  in  the  data  file.   Each  entry  in  the  index  file  contains  a 
pointer  to  an  inverted  list  of  pointers  in  the  postings  file.   The  pointers 
in  this  list  in  turn  point  to  all  the  documents  in  the  data  file  that 
contain  the  corresponding  index  term.   This  file  organization  has  been 
studied  extensively  and  is  used  in  many  well  known  systems  [1-1+]. 

A  query  to  an  inverted  file  document  retrieval  system  is  in  the 
form  of  a  Boolean  expression  of  index  terms.   For  example,  to  request 

information  on  scheduling  or  resource  management  policies  in  time- shared 
systems,  a  user  may  present  to  the  system  a  query 

"Time  Shared"  •  ("Scheduling  Policy"  +  "Resource  Management  Policy")    (l-l) 

where  •  and  +  are  the  AND  and  OR  operators,  respectively.   In  response  to  a 
request,  the  system  accesses  the  inverted  lists  in  the  posting  files 
corresponding  to  the  index  terms  in  the  query,  merges  them  and  selects  from 
the  merged  list  those  documents  that  satisfy  the  search  logic.   In  our 
example,  the  union  of  the  lists  corresponding  to  the  index  terms 
"Scheduling  Policy"  and  "Resource  Management  Policy"  is  the  list  of 
pointers  to  the  documents  on  scheduling  or  resource  management  policies. 
Let  us  denote  this  list  by  A.   The  list  A  is  obtained  by  merging  the  two 
lists  with  duplicated  entries  deleted.  We  call  the  process  of  merging  two 
lists  to  obtain  their  union  an  OR  merge.   The  intersection  of  two  lists  is 
obtained  by  merging  them  and  deleting  from  the  merged  list  all  entries  except  the 


duplicated  ones.  We  call  the  process  of  merging  two  lists  to  obtain  their 
intersection  an  AND  merge.   In  this  example,  pointers  to  documents  to  be 
selected  as  response  to  the  query  in  (1-1 )  are  obtained  by  AND  merging  the 
list  A  with  the  list  corresponding  to  the  index  term  "Time  Shared. "  In 
other  words,  these  three  lists  are  merged  in  the  order  specified  by  the  tree 
in  Fig.  1-la.  We  label  the  leaves  of  the  tree  by  the  index  terms  and  the 
internal  nodes  by  the  Boolean  operators  corresponding  to  the  merges. 
We  note  that  the  query  in  (1-1)  is  equivalent  to  the  query 

"Scheduling  Policy"* "Time  Shared "+ "Resource  Management"*  "Time  Shared" 

To  answer  this  query,  the  corresponding  lists  are  merged  in  the  order  specified 
by  the  tree  in  Fig.  1-lb.   Clearly,  the  times  required  to  produce  the  response 
might  be  different  for  the  two  equivalent  queries  written  in  different 
Boolean  forms. 

We  are  concerned  here  with  the  problem  of  determining  a  Boolean 
expression  which  leads  to  the  minimum  total  merge  time  among  all  Boolean 
expressions  that  are  equivalent  to  the  expression  given  in  the  query.   For  this 
purpose,  we  describe  the  inverted  file  document  retrieval  system  schematically 
as  shown  in  Fig.  1-2*   To  process  a  query,  the  lists  corresponding  to  the  index 
terms  in  the  query  are  read  into  the  buffer  memory  and  are  merged  after  they  ha" 
been  placed  in  the  buffer.   The  total  retrieval  time  is  equal  to  the  sum  of  th< 


The  total  merge  time  computed  here  is  equal  to  the  amount  of  time  required 
of  the  merge  processor  to  process  the  query.   The  merge  order  that  minimizes 
this  time  does  not  minimize  the  total  retrieval  time  in  general.   However, 
keeping  the  total  merge  time  minimized  becomes  important  in  multiuser  systems 
in  which  many  users  share  a  large  buffer  memory.  While  the  lists  corresponding 
to  the  index  terms  in  the  query  of  a  user  are  being  merged,  the  lists  specifie< 
in  the  queries  of  other  users  can  be  loaded  into  the  buffer. 
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time  required  to  access  the  lists  from  the  secondary  memory  and  the  merge 
time  of  the  lists.  In  this  paper,  no  attempt  is  made  to  minimize  the 
former.   (The  average  access  time  of  the  lists  from  the  secondary  memory 
has  been  estimated  elsewhere  [5,6],   The  dependence  of  list  access  time 
on  the  access  algorithm  and  buffer  management  scheme  used  in  the  system 
is  the  subject  of  a  separate  study  [7].)  Furthermore,  only  the  case  of 
two-way  merges  is  considered  here. 

In  Section  II,  we  introduce  the  terminologies  and  notations 
necessary  in  our  discussion.   In  Section  III,  we  assume  that  the  lengths 
of  overlaps  between  different  lists  are  very  small  compared  with  the  lengths 
of  the  lists.   Hence,  in  computation  of  the  total  merge  time,  lengths  of 
overlaps  between  lists  can  be  neglected.   Several  algorithms  are  described 
in  Section  III.   These  algorithms  allow  us  to  find  optimal  Boolean 
expressions  when  queries  are  written  as  nested  Boolean  expressions  in  which 
(i)  all  variables  are  distinct,  and  (ii)  the  complement  of  any  variable,  B, 
can  appear  only  in  product  terms,  such  as  A.  •  A  •  ...  -A  •  B,  with  at 
least  one  of  the  variables  A.  being  uncomplemented.   These  algorithms  are 
no  longer  optimal  when  the  lengths  of  overlap  between  lists  cannot  be 
neglected.   In  this  case,  it  is  possible  to  bound  the  performance  of  these 
algorithms  in  some  instances  in  terms  of  the  maximum  overlap  between  lists. 
We  discuss  these  bounds  in  Section  IV. 


II.   Notations 

Throughout  our  discussions,  we  use  upper  case  letters  (e.g.,  A,  B,  C, 
D)  to  denote  both  index  terms  and  their  corresponding  inverted  lists.   Lower 
case  letters  are  used  to  denote  the  lengths  of  the  corresponding  lists 
(e.g.,  a,  b,  c  and  d  denote  the  lengths  of  the  lists  A,  B,  C,  and  D, 
respectively).   Let  a (A, B)  denote  the  lengths  of  overlap  between  lists  A 
and  B.   The  lengths  of  the  resultant  lists  obtained  by  AND  merging  and  OR 
merging  the  lists  A  and  B  are  a(A,B)  and  a+b-cr(A,B),  respectively.   Let  A 
denote  the  complement  of  the  index  term  A.   The  list  corresponding  to  A*B  is 
obtained  by  merging  the  lists  A  and  B  and  selecting  from  the  merged  list 
those  entries  that  are  in  list  B  but  not  in  list  A.   We  call  this  merging 
process  an  AND  NOT  merge.   Clearly,  the  length  of  the  list  obtained  by 
AND  NOT  merging  A  and  B  is  equal  to  b-cr(A,B).  We  shall  not  be  concerned 
with  Boolean  expressions  containing  terms  of  the  forms  A+B  and  A*B.   This 
restriction  leads  to  no  loss  of  generality  since  Boolean  expressions  of 
these  forms  are  explicitly  ruled  out  in  most  installations. 

Consider  a  query  written  as  a  Boolean  expression  of  the  index  terms 


An,A0,...,A  .   Let  Q(An,A„, . . .,A  )  denote  the  truth  function  determined  by 
]/  2'     n         1  2'    '  n 

this  expression.    (A  truth  function  is  usually  represented  by  a  truth  table 
for  the  Boolean  expression.)  We  note  that  the  truth  function  Q(A1,A2, . . ,,An) 


More  precisely,  a  Boolean  expression  of  the  index  terms  A.,,A~,...,A 

1  2'        '   n 

specifies  an  element  in  the  free  Boolean  algebra  with  n  generators.   This 

element  in  the  free  Boolean  algebra  in  turn  specifies  the  truth  function 

Q(A1,A2,...,An). 


determines  a  unique  list  of  pointers.  This  list  is  the  valid  response 
to  all  queries  written  as  Boolean  expressions  that  determine  this  truth 
function  and  this  list  can  be  obtained  by  merging  the  lists  A,  ,Ap, .  ..,A  . 
Let  T(F(A, ,Ap, . . .,A  ))  denote  a  tree  specifying  the  merge  order  of  lists 
A_,Ap,...,A  corresponding  to  the  Boolean  expression  F(A, ,A~, . . .  ,A  ).  Again, 

the  leaves  of  the  tree  are  labeled  with  the  names  of  the  corresponding  lists 
while  the  internal  nodes  are  marked  by  the  Boolean  operators  corresponding 
to  the  merges.   (Examples  of  such  trees  are  shown  in  Fig.  1-1. )  We  say  that 
the  tree  T(F(A  ,Ap,  . .  .  ,A  ))  realizes  the  truth  function  Q(A  fP^,...,k   )  if 
Q,(A,,Ap,  . . .  ,A  )  is  the  truth  function  determined  by  F(A  ,Ap,...,A  ).      Since 
there  are  many  ways  to  parenthesize  a  Boolean  expression  (e.g.,  A«(B+(C+D)) 
and  A*((B+C)+D)  are  two  different  ways  to  parenthesize  the  expression 
A»(B+C+D)),  corresponding  to  a  Boolean  expression,  there  are  many  different 
binary  merge  trees.   We  distinguish  them  by  using  different  subscripts. 
(For  example,  T  (A-(B+C+D))  and  T?(A- (B+C+D))  denote  two  different  binary 
merge  trees  for  the  expression  A«(B+C+D)).   When  there  is  no  possible 
confusion,  we  also  refer  to  the  tree  T(F(A,,Ap,  . .  ,,A  ))  simply  as  T. 

The  time  required  to  merge  two  lists  is  proportional  to  the  sum  of 
their  lengths  for  all  three  types  of  merges.   (To  be  specific,  let  the 
proportional  constant  be  1.)  The  cost  of  the  tree  T(F(A  ,A„,  . .  .,A  )),  denoted 
C(T),  is  equal  to  the  total  merge  time  of  the  lists  A  ,A  , ...,A  when  the 
order  of  the  merges  is  specified  by  T.   A  tree  is  said  to  be  optimal  when 
its  cost  is  minimum  among  all  trees  which  realize  the  truth  function 
Q(A1,Ag, ...,A  )  determined  by  the  Boolean  expression  F(A  ,Ap,...,A  )  in  the 
query.   With  a  slight  abuse  of  the  notation,  we  denote  the  optimal  tree  by 
T0^F^A1,A2' "  *,An^'   Hence  the  problem  of  finding  a  Boolean  expression 
corresponding  to  the  minimum  total  merge  time  is  the  same  as  that  of  finding 


an  optimal  merge  tree  among  all  trees  that  realize  the  truth  function 
Q(A1,A2,...,An). 


III.  Algorithms  for  Determining  Optimal 
Merge  Trees 


In  this  section,  we  assume  that  the  lengths  of  overlap  between 
different  lists  are  very  small  compared  with  the  lengths  of  the  lists. 
Hence,  in  computation  of  total  merge  time,  lengths  of  overlaps  can  be 
neglected.   In  this  case,  the  length  of  the  resultant  list  obtained  by  OR 
merging  list  A  and  B  is  equal  to  a+b.   The  length  of  the  resultant  list 
obtained  by  AND  merging  A  and.  B  is  very  small  compared  to  a  or  b.   In  the 
computation  of  merge  time,  it  is  assumed  to  be  zero.   The  length  of  the 
list  obtained  by  AND  NOT  merging  A  and.  B  is  approximately  equal  to  b. 

The  lengths  of  overlaps  between  lists  being  negligibly  small,  an 
optimal  tree  for  OR  merging  the  lists  A..,Ap,...,A  to  realize  the  truth 
function  determined  by  the  Boolean  expression  A  -t-Ap+...+A  is  one  with 
minimum  weighted  path  length  where  the  weights  of  the  leaves  are  the  lengths 
of  the  lists  and  the  weights  of  internal  nodes  are  zero.   Such  a  tree  can 
be  constructed  using  the  Huffman's  procedure.  We  call  the  resultant  tree  a 
Huffman  tree  for  A,,Ap,...,A  and  denote  it  by  T0(A,+Ap+. . . +A  )  [9]. 

A  property  of  an  optimal  tree  is  that  all  its  subtrees  are  optimal. 

It  follows  that  if  T,n  and  Tpn  are  two  subtrees  of  an  optimal  tree 

T  (A  +A^+...+A  )  obtained  by  removing  the  root  of  T  ,  then  C(T  )  +  C(T2  ) 

is  minimum  among  all  possible  two  subtrees  of  an  arbitrary  tree 

T(An  i-An . . .  i A  ).   Similarly,  let  T,~,T0~, . .  .,T  «  denote  m  subtrees  of 
11     n  ^"      10'  20    '  mO 

T0(A,+A_-k  . . hA  )  obtained  by  removing  the  roots  of  larger  subtrees  of  Tn. 
Then 

C(TiQ)  +  C(T2Q)  +  ...  +  C(TmQ)  <  0(1^  +  C(T2)  +  ...  +  C(Tm)         (3-1) 


10 


where  T  ,T  ,  ...,T  are  m  subtrees  of  an  arbitrary  tree  T(A-L+A2+. .  .+An).  We 
call  the  subtrees  T1(y  T20' ' '  *  >  Tm0  Huffman  subtrees. 


Optimal  Merge  Trees  for  Boolean  Expressions  of  the 
Form  (A1+Ap+.  .  ,+A  )*B 

Let  A  ,Ap,  ...,A  and  B  be  n+1  lists  with  lengths  a^a^,  ...,an  and  b, 
respectively.   To  find  an  algorithm  which  yields  an  optimal  form  of  the 
Boolean  expression  (A  +A2+, . ,+An)'B,  we  consider  the  tree  shown  in  Fig.  3-1 
where  T  ,  Tg, . .  ,,Tm  are  m  subtrees  of  T(A1+A2+. . .  ^Aj.  We  note  that 

Lemma  3-1.   If  the  tree  in  Fig.  3-1  is  an  optimal  merge  tree  among 

all  trees  that  realize  the  truth  function  determined  by  the  expression 

(A  +iU+...+A  )-B,  then  T  ,T0,  .  ..,T  are  Huffman  subtrees  of  T  (An+A„+. .  .+A  ). 
l     d.  n  ±     d.  m  o  12      n 

Proof:   The  cost  of  the  tree  in  Fig.  3-1  is 

n 
C(T)  =  C(T  )  +  C(T2)  +  ...  +  C(T  )  +  Z  a.  +  mb 

i=l 

Because  of  (3-1)^  this  cost  is  minimized  when  T-,,T~,  ...,T  are  Huffman 
subtrees.  M 

As  a  consequence  of  Lemma  3-1,  to  find  an  optimal  tree  which  realizes 
the  truth  function  determined  by  the  Boolean  expression  (A,+Ap+...+A  )«B, 
we  need  to  consider  only  trees  corresponding  to  expressions  of  the  form 

(  Z  A  )-B  +  (  Z  A.  ).B  +  ...  +  (  Z  A.  )-B  (3-2, 

A.eS,  1      A.eS0  x  A.eS"  1 

i  1         l  2  l  m 


where  S   (l  <  j  <  m)  is  the  set  of  leaves  of  the  j   Huffman  subtree  of 


.th 

) 

TQ(A1+A2+. . ,+A  ).   Furthermore,  an  optimal  merge  tree  that  realizes  the 

truth  function  determined  by  (A-.+A-  +  . .  ,+A  )'B  can  be  obtained  from 

12      n 
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Figure  3-1 
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Theorem  3-1 


(A  +A-  +  ...+A   )«B  is   an     optimal  Boolean  form  if  and  only  if 


*1  2 n 


n 

E  a.  <  b 
1=1  1_ 


Hence  an  optimal  merge  tree  for  (An+A„  +  ...+A  )«B  is  the  tree,  T  ,  shown  in  Fie. 

1  2      n  u  & 

3-2a,  where  TQ(A  +A-  +  .  .  .+A  )  is  the  Huffman  tree  for  the  lists  A^Ap,  ...,A  . 

Proof:  Let  Tn_  and  T„~  be  two  Huffman  subtrees  of  T„(A,+A_+. .  .+A  ) 
10      20  0  1  2      n 

and  S  be  a  subset  of  {A^,A^, ...,A  }.   Because  of  Lemma  3-l>  the  tree  in 

Fig.  3-2b,  T,j  will  have  the  minimum  cost  among  all  trees  T(F(B,A_,Ap,  . .  ,,A  )) 

corresponding  to  Boolean  expressions  of  the  form 

F(B,A  ,Ap,..0A  )  -~  (  E  A  )-B  +  (         E        A  )-B 
-  c-  n     A.eS        A^C^Ag,...^  }-S 


The  cost  of  T,  is 
d 

Cd  =  C(T10)  +  C(T20)  ,     I     a.  +  2b 

1=1 


C(TQ)  +  2b 


while  the  cost  of  T  is 

u 


n 


Therefore,  we  have 


if  and  only  if 


C  =  C(T-_)  +  E  a.  +  b 
u    v  0    .  ,   l 

i=l 


d  —  u 


n 

E  a.  <  b  (3-3! 

i=l  1_ 
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Figure  3-2 


Ill 


We  now  show  that  the  inequality  in  (3-3)  implies  that  the  tree  T  in 

Fig.  3-2a  is  indeed  optimal.  Again,  because  of  Lemma  3-l>  it  suffices  for 

us  to  show  that  (3-2)  implies  that  the  cost  of  the  tree  I  shown  in  Fig.  3-3a, 

C  ,  is  less  than  the  cost  of  the  tree  T(m+]_)  shown  in  Fig.  3-3b,  C  .,,  where 

T    and  T.„~  are  two  Huffman  subtrees  of  T.„.  We  note  that 
j  10      3d<J  Du 

m  n 

Cm+1  "  „En  C(V  +  C(5W  +  C(TJ20)  +  •_-  ai  +  (m+1)  ^ 


and 


But 


m  n 

C  =  S  C(T.  n)   +  Z     a.  +  mb 
k-1         i=l 


J       J        °     A.eS. 

i  0 


where  S.  is  the  set  of  leaves  of  T.„.   That 
J  JO 

C    -  C  =  b  -   S  ;  a. 

m+1    m      A.eS.  x 

is  equal  to  or  larger  than  zero  is  clearly  implied  by  the  inequality  (3-3).  " 

From  Theorem  3-l>  we  have  algorithm  A  for  finding  an  optimal  merge 

tree  for  (A  +a  +...+A  )-B. 

Algorithm  A 

n 
1.   a.   If  b  >  £  a.,  (A-.+Ap+. .  ,+A  )«B  is  an  optimal  Boolean  form  and 
i=l 
an  optimal  merge  tree  is  T  shown  in  Fig.  3-2a  where  T_(A-.+Ap+. .  ,+A  . 


is  the  Huffman  "cree  for  the  lists  A,,  A-,...,  A  . 
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m 


(a) 


(m+1) 
(b) 


Figure  3-3 
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n 
b.   If  b  <  E  a.,  choose  the  merge  tree  T  shown  in  Fig.  3-2b  where 
i=l  x  d 

T1Q  and  T2Q  are  two  Huffman  subtrees  of  T„(A  +A^+. . .+A  ).  The 

corresponding  Boolean  expression  is 

(  E  A.  )-B  +  (   E  A.  )-B 

A.eS,  1       A.eS0  ± 
1  1         i  2 

where  S  and  Sp  are  the  sets  of  leaves  of  T,   and  Tpn. 

2.  An   optimal  tree  can  be  obtained  by  repeating  step  1  for  each  of  the 

terms  (E  A. )«B. 
x 

Consider  the  example  shown  in  Fig.  3-k.      The  lengths  of  the  lists 
B,  A  ,  Ap,  A  and  A>  are  5,  1,  2,  5  and  10,  respectively.   The  Huffman  tree 
for  Aj+Ap+A  +A^  is  shown  in  Fig.  3-1+a  together  with  its  two  subtrees  T   and 
T2Q.   Since  (a,+a2+a„+a,  )  =  18  >  b,  the  cost  of  the  tree  T  in  Fig.  3-Ub  is 
less  than  the  cost  of  any  tree  T(B- (A  +Ap+A  +A.  )).   (C(T  )  =  39  and 
C(t(B«  (A  +Ap+A  +A.  ■)) )  >  52.)  Hence,  we  choose  the  Boolean  expression 
corresponding  to  the  tree  T  , 

B-(A1+A2+A3)  +  B-A^ 

instead  of  B« (A,+A?+A  +A.  ).   Repeating  step  1  for  the  term  B-(A,+Ap+A  ), 
we  choose  to  distribute  .  operation  with  respect  to  +  operation  and  obtain 

B- (A  +Ap)  +  B-A  +  B-A. 

The  corresponding  merge  tree  is  T'  shown  in  Fig.  3-^-c.   Since  a,  +  ap  <  b, 
we  conclude  that  TI  is  an  optimal  merge  tree  among  all  trees  that  realize 
the  truth  function  determined  by  the  expression  B« (A, +Ap+A~+A,  ).   Indeed, 
C(T^)  =  36  and  C(T(B'A  +B-A^+B»A  +B'A^))  =  38. 
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T0(A1+A2+A3+A1+) 


A, 


T, 


20 


■10 


(a) 


(b) 


A. 


*2 


(c) 


Boolean  expression  B* (A, +Ap+A_+A.  ) 
Figure  3-U 
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Optimal  Merge  Trees  for  Boolean  Expressions  of  the 

Form  (A.,  +A^+. . .  +A  ) •  (B,  +B^+. . .  +B  ) 
*— 1 — 2 ny    *   1 — 2 mM 

Lemraa  3-1  and.  Theorem  3-1  can  be  generalized  to  the  case  when  the 

Boolean  expression  specified  in  the  query  is  of  the  form 

(A,+A0+...+A  )'(Bn+B„+...+B  ) 
1  ^      rr   1  2      nr 

To  do  so,  let  us  consider  the  tree  T  shown  in  Fig.  3-5  where  T.-., T.~, . . . ,T.  . 

are  subtrees  of  T(A  +A  +...+A  )  and  TBn,T_0, . . .,Tm    are  subtrees  of 

12      n       Bl  B2'     Bk 

T(Bn+B^+. . ,+B  ).  We  state  without  proof: 
v  1  2      m 

Lemma  3-2.   If  the  merge  tree  T,  is  optimal  for  the  Boolean  expression 

(A,+A0+...+A  )• (Bn+B0+...+B  ),  then  T. - ,T.0, . . ., T. .  are  Huffman  subtrees  of 
v  1  2      n    1  2      m        Al  A2    7  Aj 

T  (A1+A2+...+An)  and  T  -^T^,  ' '  '>  TBk  are  Huffman  subtrees  of  To^Bl+B2+* '  *+Bm^ 

In  this  case,  we  have 

Theorem  3-2. 

(A,+Ap+. . ,+A    )• (B,+Bp+...+B    )  is    an    optimal  Boolean  expression  if 

and  only  if 

n  m 

£     a.    =     T,     b. 
i=l  i=l 

Hence  an  optimal  merge  tree  for  (An+Ap+...+A  )" (B,,Bp,+. . .+B  )  is  the 

tree  T  in  Fig.  3-6  where  T.  and  T_  are  Huffman  trees  for  A,,A_,...,A  and 
u  A      B  1  d'  n 

Bn,B  , .. ,,B  ,  respectively. 

Proof:  We  compare  the  cost  of  the  tree  T  in  Fig.  3-6a  with  that 
of  T_.  and  T,  in  Fig.  3-6b  and  3-6c.   The  Boolean  expressions  corresponding 

to  the  tree  T,.  and  T^  are 

dA      dB 
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I 

m 


•H 


20 


u 
(a) 


Figure  3-6 
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(       Z      A. )-(Bn+B0+...+B    )+(       Z      A. )-(B,+B0+...+B    ) 
\  i  1     2  m      \     /„       i  1     2  m 


VS1A  VSOA 


and 


(       Z      B    ).(A  +A  +...+A    )+(       Z       B    )'(A  +A  +...+A    ) 

B.GS1R  ^  1     1  n       b.^S_   !  !     2  n 

l     IB  r     IB 

respectively,  where  S,   and  S,   are  sets  of  leaves  of  T   and  T -, 

respectively.   Let  C  and  C-..  be  the  costs  of  the  trees  T  and  T, ., 
r      ^        u      dA  u      dA 

respectively. 

n       m 

C  =  C(T.)  +  C(T  )  +  Z  a.  +  Z  b. 
U      A       B    .  ,   l    .  ,   l 

1=1      1=1 

and 

m  n 

CdA  "  C(V   +  C<TA2>   +   °(TB>   +  2     *     \   +     *     ai 

1=1  1=1 


But 


n 


Hence 


C(TA)  =  C(TA1)  ♦  C(Ta2)  ♦  E  a 

1=1 


n  m 

C  -  C,A  =  Z  a.  -  Z  b. 

u    dA   .  ,   l  .  ,   i 

i=l  i=l 


which  is  less  than  or  equal  to  zero  if  and  only  if 

n       m 

Z  a.  <  Z  b. 
.  ,  l  -  .  ,  l 
i=l      i-=l 


Similarly, 


m       n 
Z  b.  <  Z  a4 
i=l      1=1 
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implies 


C  -  C_  <  0 
u    dB  — 


where  C,_  is  the  cost  of  the  tree  T,^.   In  other  words, 

dB  dB  ' 

Cu  =  CdB  =  CdA 

when 

n       m 

Z  a  =  Z_Jb  (3-1+) 

i=l     1=1  x 

We  need  to  show  that  because  of  Eq.  (3-^-)*  C  is  no  greater  than  any 
other  tree  which  realizes  the  truth  function  determined  by  the  Boolean 
expression  (A  +Ap+...+A  )«(B,+...+B  ).   Let  T  shown  in  Fig.  3-5  be  such  a 
tree.  Again,  because  of  Lemma  3-2,  T  , ,  T  „,  ...,  T   are  j  Huffman  subtrees 

of  T.  and  T_.,,  T^0,  ...,  TL,,  are  k  Huffman  subtrees  of  T_.   Let  S.  be  the 

A  Bl'      B2'  '      Bk  a  Ap 

set   of  leaves   of  T.    (p  =  1,2,  ...,j)   and  S_,     be  the  set  of  leaves   of 
Ap  '        '  Bq 

TL    (q  =  1,2,  ...,k).     We  note  that  the  Boolean  expression  corresponding 

Bq 

to  the  tree  T,    is 

k 

Z       Z      (       Z      A   )•(      Z      B    )  (3-5) 

p=l  q=l     A.eSA      ±       B.eS^ 
•^  i     Ap  i     Bq 

Let  1L    -    and  TL   „  be  two  Huffman  subtrees  of  TL     with  S^   ,    and 
Bql  Bq2  Bq  Bql 

S     p  be  the  sets   of  leaves,    respectively.      Let  T,    ,  (l;   r)  denote  a  tree 
corresponding  to  the  Boolean  expression 


J       k  j 

Z        Z    (       Z      A.)-(       Z      B.)   +        Z         (       Z      A.  )•(       Z      B.) 

p=l  q=2  A.eS.     x       B.eS^     x         p=r+l    A.eS.     1       B.eS_..    1 
^  l     Ap  l      Bq  *  l     Ap  .1     Bl 


r  r 

+     Z    (       Z      A.  )•(        Z        B.  )   +     Z    (       Z      A.  )'(        Z        B.) 

p=l  A.eS.     1      B.eS^..   x        p=l  A.eS.     x      B.eS^no  1 
^  i     Ap  i     Bll  r  i     Ap  l     B12 
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obtained  by  further  distributing  the  .  operation  with  respect  to  +  operation 
in  the  sum  term   Z   B.  in  (3-5)-  We  note  that  for  any  1  <  r  <  j, 


VSB1      ' 


C(Tk+1(l;  r))  >  C(Tk) 


This  inequality  follows  from 

j  k  n  n 

CtT,  )  =    Z     C(T.    )  +    Z     C^OL    )   +  k    Z     a.    +  j     Z     b. 
v  ky  ,      v  Ap'  ,      N  Bqy  .    ,      l       °    .    ,      l 

p=l  *         q=l  ^  i=l  1=1 

while 

C(^+1(l;  r))  =    Z     C(T     )  ♦    Z     C(^)  *  k    Z     a.    ♦  j    S     b 

p=l  *  q=l  n  1=1  1=1 


r 

+    Z  Z       a. 

p=l  A.eS.        x 
*         l     Ap 


When  r=j,   we  have 


c(Tk+1(l;  j))  =   s    c(T    )  ♦    z    c(lB  ) 

p=l  q=2  n 


n  m 


+  C(TB11)   +  C(TB12)   +   (k+1)     Z     a     +  j     Z     b. 

i=l  i=l 


But 


C^l>   =  C(TB11^   +   C<  W   +  T>   _E„       bi 
Hence 


Bi£SBl 


n 


C(Tk+1(l;  j))  -  C(Tk)   =    L     a±-         Z       b.  (3-6) 


i=l  B.eSB1 


2k 

which  is  larger  than  or  equal  to  zero  when  Eq.  (3-*0  is  valid.   In  general, 
let  T,  n(t;  j)  be  the  tree  corresponding  to  the  Boolean  expression 


Z   (   Z   A.  )  •  (   Z   B,  )  +  Z 


p=iU=t+iAl6s      x        B.eSB.    1       q=i.LvsAp    x        B.esBql   x 


(   Z   A.)  •  (   Z    B.) 


+  (   Z  A  )  •  (   T,  B  ) 

A.eS.       B.eS_  _   _ 
l  Ap        i  Bq2 


for  1  <  t  <k  and  its  cost  C(Tk+1(t;  j)).  We  have, 


n       -c 
C(T.  .At;   3))  -  C(T  )=t  Z  a  -  Z     Z   b 

k  i=i         q=iB.esBq 

which  is  equal  to  or  larger  than  zero  when  Eq.    (3-^0  is   valid.  ■ 

n  m 

When     Z     a.    <     Z  b.,    the  tree  T     is  no  longer  optimal.      In  this  case, 
i=l     X       j=l  ,] 

let   S     ,    1  <  j  <  k,  be  the  set  of  leaves   of  k  Huffman  subtrees  of  0L    such 
AD  —  B 

that  their  union  is   f B_,,B„, . .  .,B  1.      Moreover, 

12'        '   mJ  ' 

n 
Z      b     <     Z     a  3   =  1,2,... ,k 

B.eS^.  x       i=l     1 

i      Bj 


but 


EM  b     >    Z     a  3   =  1,2,... ,k  and  tfy . 

WV      i=1 


It   follows  from  Eq.    (3-6)  that 
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Corollary  3-1 


n       m 


When  E  a..  <  E  b^,  an  optimal  tree  corresponding  to  the  Boolean 
expression 


i-l  x      0=1  3 


k 

E   (A  +A  +...+A  )•(   E   B  ) 

*=1  VSBd 


has  the  minim-urn  cost  among  all  trees  that  realize  the  truth  function 

determined  by  (An+Ar,  +  .  ..+A  )•  (Bn+Bn  +  .  .  ,+B  ). 
°    v  1  z  n    1  2      nr 

m       n 

Similarly,  for  the  case  of  E  b.  <  E  a.,  let  S  .,  1  <  j  <  k,  be 

j=l  J   i=l  x       AJ 

the  set  of  leaves  of  k  Huffman  subtrees  of  TA  such  that  their  union  is 


(A1,A2,  ...,An},  and 


m 


but 


E   a.  <  E  b.    j  =  1,2,. ..,k 

VSAj  *  "  ^  ' 


m 
E     a.  >   E  b.    o,  j'  -  1,2,  .  ..,k  and  j/j  • 


We  have 

Corollary  3-2 

n       m 
When  E  a.  >  E  b  ,  an  optimal  tree  corresponding  to  the  Boolean 

1=1  1   3=1  J 

expression 

E  <W...«*WM   E   A.) 

3-1  A^S^ 
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has  the  minimum  cost  among  all  trees  that  realize  the  truth  function 

determined  by  (A+A0+...+A  )• (Bn+B«+. . .+B  ). 
12      n    1  2      m' 

An  algorithm  to  determine  an  optimal  tree  in  this  case  is 

Algorithm  B 

n       m 

1.  If  Z     a  =  2  b.,  (A  +A.+...+A  V(B1+B0+...+B  )  is  an 

i=l     i=l  n'  v  1  2     m' 

optimal  expression.  An  optimal  tree  is  T  shown  in  Fig.  3-6a  where  T  and 

TB  are  Huffman  trees  of  A-^Ag,  . .  .,A  and  B.,, Bg, . ; . ■, B  ,  respectively. 

n       m 

2.  If  Z     a.  <  2  b., 

i=l  x   1=1  ^ 

a.   Choose  the  Boolean  expression 


A 


+Ap+i..+An)'(   Z       B  )  +  (A  +A.+...+A  )•(   I   B,  ) 
Bi£SBl  VSB2 

where  S01  and  Sr)0  are  sets  of  leaves  of  the  two  Huffman  subtrees  of  T  . 
The  corresponding  merge  tree  is  T   in  Fig.  3.6c. 

b.  For  each  of  the  terms  (A  +Ap+. . .+A  )■(  Z     B.  ),  if 

n   B.eS^  .  x  ' 
1  B3 
n 

Z    a  <    £   b.,  distribute  *  with  respect  to  the  sum  of  the  B! s  such  that 
i  Bj  ' 

the  B.'s  in  each  of  the  sum  terms  are  elements  of  sets  of  leaves  of  smaller 

l 

Huffman  subtrees. 

i 

c.  The  process  in  step  2-b  terminates  either  when 
n  n 


Z    a,.    >        Z      h.    or  when  we  obtain  terms  of  the  form  (  Z    A.  ) •  B .. . 
.eSL.  "' 


i=l  x       B.eS,,  .  1  i=l  1  3 
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n  m 

3.      If     2     a.    >     2     b., 
1=1  1=1 

a.  Choose  the  Boolean  expression 

(B1+Bp+...+B    )■(        I       A   )+(b'+B 2+...+B    )•(        £       A) 

VSA1  AiGSA2 

where  S  ,  and  S„^  are  the  sets  of  leaves  of  the  Huffman  subtrees  of  T  . 
Al      A2  ■"• 

The  corresponding  merge  tree  is  T   in  Fig.  3.6b. 

b.  For  each  of  the  terms  (Bn+B0+...+B  )•(   2  A.),  if 

1  2      m   A.£SA.  X 
l   Aj 

m 

Z  b  <    2a,  distribute  *  with  respect  to  sum  of  A.' s  in  each  of  the 

i=l     A.eSA. 
i  Aj 

sum  terms  such  that  the  A. *s  in  each  of  the  sum  terms  are  elements  of 

i 

leaves  of  smaller  Huffman  subtrees. 

m 

(c)  The  process  in  3b  terminates  either  2  b.  >    2   a.  or 

i=l  1   A.eSA. 
i   Aj 

m 

when  we   obtain  terms   of  the   form   (    2     B. )-A.. 

i=i  x     d 

We  illustrate  Algorithm  B  by  an  example.   Consider  the  expression 
(A  +Ap+A  +A.  )• (B^+Bp).   The  lengths  of  the  corresponding  lists  are 
a  =  1,  a  =  2,    a  =  3>  a.  =  6,  b  =  2  and  bp  =  2.   The  Huffman  trees  for 
the  lists  A  ,  A  ,   A  and  A^  and  for  lists  B  and  B  are  shown  in  Fig.  3-7a. 
Since  a  +  aQ  +  a.     +  a<  >  b  +  bp,  the  cost  of  the  merge  tree  T  in  Fig.  3- 7b 
corresponding  to  the  Boolean  expression 

Fl  *  (B1+B2).(A14-A2f-A3)  h  (B^BgJ-A^ 
is  less  than  the  cost  of  the  tree  T(  (A ,  t-Ap+A  +A.  )■  (B  +Bp ) ).  Moreover,  since 
a1+a2+a3  >  b-^bp,  the  cost  of  the  tree  Tp  in  Fig.  3.7c  is  less  than  the  cost 
of  T  .   Indeed,  T0  is  an  optimal  tree  and  the  Boolean  expression  corresponding 
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T. 


(a) 


B 


B, 


(b)      Tx 


(c)      Tc 


Figure  3-7 
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F2   =    (B1+B2)-(A1+A2)   +    (B1+B2)-A3   +    (B^B^-A^ 
(C(T(A1+A2+A3+Ai+)-(B1+B2)))  >  kl,    C(T±)   =  33,    C(T2)   =  31.) 

Farther  Generalizations 

To  find  an  optimal  merge  tree  when  the  Boolean  expression 
specified  in  the  query  is  a  product  of  sum  terms 

m 

P     =     it    (A.n+A.Q+...+A.      )  (3-7) 

m        .=1K   fll     02  on. 

and  A.,  are  all  distinct,  let  C(k)  denote  the  cost  of  the  merge  tree  T(P  ). 
ji  m 

Suppose  that  we  first  complete  all  those  merges  corresponding  to  the  Boolean 

expression 

m-1 
Pm  ,  =  jt  (A.n  +  A.0  +  ...  +  A.   ) 

Since  the  lengths  of  overlaps  between  lists  are  assumed  to  be  negligibly  small, 

it  follows  from  Theorem  3-1  that  the  cost 

n 
m 

C(m)  -  L     a  .  +  C(m-l) 
.  ,  mi 
i=l 

is  minimum  in  this  case  and  the  corresponding  Boolean  expression  is 

P  =  A  ,  •  P  ,  +  A  „  •  P  ,+...+   A    •  P  , 
m    ml    m-1    m2    m-1         mn     m-1 

m 

Similarly,  we  have 

n. 

C(m)  =  L^     T^   a..  +  C(T(  (A^A^-H.  . .  i-A^)-  (A^-A-.,  ►.  . .  ^  ) )) 

Suppose  that  the  indices  are  chosen  so  that 
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ni       n2 
C(T((A11+A12+...,Al2i  ).(A21+A22  +  ...+A2   )))-  L     a  -  Z  a. 

1  2     i=l      i=l 


\ 


<C(T((A.1+A.2+...+A.n_j).(Alcl+Ak2+...+Aknk)))  -  _^a.  -  ^a. 


for  all  j,  k  =  3,^,  ...,m. 


Corollary  3-3 
n. 

If  E  a.,  are  equal  for  all  j  =  1,  2,    . ..,  m,  the  cost  of  the 
3=1  01 

optimal  tree  corresponding  to  the  Boolean  expression  in  (3-7)  is 

n. 
m   j 

C  (m)  =  £   I  a   +  C(T(A   i  A  +...+  A   )) 

3=1  i=l  01       "^   ■        n 

+  c(iyy...+  j^)) 

n. 
Furthermore,  when  I,  a.,  are  not  all  equal,  the  cost  of  the  optimal  tree 


i=l  J1 


is  given  by 

Corollary  3-^- 


n. 

Co(m)  =  *     =  aoi  +G^^Aii+An2+---+Ain1)'(A21+A22+,,-+A2n2))    (3"b) 

When  the  Boolean  expression  given  in  the  query  can  be  written  as 
a  sum  of  product  terms 

m 

S  =  Z  A.n  •  A.0  ...  *  A. 
m   J=1  31    02       jn. 

with  all  A.,  being  distinct,  we  note  that  distribution  of  the  +  operation 
j  i 

with  respect  to  the  *  operation  will  lead  to  increase  in  the  total  merge 
time.   Hence,  the  optimal  cost  in  this  case  is 
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n. 
C  =  Z   Z  a.. 

^11=1  » 


Optimal  Merge  Trees  for  Boolean  Expressions  Containing 
Complements  of  Variables 

To  generalize  the  results  discussed  above  to  Boolean  expressions 
containing  complements  of  variables,  we  consider  expressions  of  the  form 


(A1+A2+...+An)-B  (3-9) 


In  this  case,  we  have 
Theorem  3-3 

(A  +Ap+. . . +A  )*B  is  an  optimal  expression  among  all  expressions 
equivalent  to  it.   Hence  an  optimal  merge  tree  is  T  shown  in  Fig.  3-8 
where  the  symbol  .  -i  is  used  to  denote  an  AND  NOT  merge. 

Proof:  The  cost  of  the  merge  tree  corresponding  to  the  Boolean 
expression  in  (3-9)  is 

n 

C  =  C(T, )  +  Z  a.  +  b 
u    v  y       .  .  i 
1=1 

where  T.  is  a   Huffman  tree  for  the  lists  A,,  A„,  ....  A  .   Let  TA ,  and 
A  1 '      z'  n       Al 

T.0   be  two  subtrees  of  T(A,  +  A0+  ,..+  A  )  with  sets  of  leaves  S-,  and  SAO, 
Ad  1   2        nJ  Al      A2' 

respectively.   The  cost  of  the  merge  tree  corresponding  to  the  Boolean 
expression 

(       L       A.  )-B+(       Z       A.  )-B 

A.eSA1      X  A.eSft0     X 

l     Al  l     A2 

is 
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B 


T0(A+V...+An) 


u 


Fig.   3-8 
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n  n 

Cd  =  C(TAl)  +  C^TA2^  +  Z  ai  +  2b  +  Z     ai 

i=l         i=l 


n 
>  C(T  )  +  E  a.  +  2b 
A    i=l  1 

=  C  +  b 

u 

Similarly,  let  T..  for  i  =  1,  2,    . ..,  ra  be  m  subtrees  of  T.  and  S   be 

the  sets  of  their  leaves, 

n 

C,   =  C(T.)  +  rab  +  E  a, 
dm      A  i 

i-1 

>  c  ■ 

u 

We  further  observe  that  the  cost  of  the  merge  tree  corresponding  to 
the  Boolean  expression 


B   •   A1+A2+...+An 

is  equal  to  that  of  B  •  (A,+Ap  +  .  . . +A  ).   Again,  let  T..  and  T.0   be  two 

Huffman  subtrees  of  T^(A-+A-+ . .  ,  +  A  )  with  their  sets  of  leaves  being  S.., 

0  1  2     n  Ai 

and  S  „,  respectively.   The  cost  of  the  merge  tree  corresponding  to  the 
Boolean  expression 


is  equal  to 


E       A.    •  E       A. 

VSA1     "       VSA2     " 


C(T      )   ♦   C(T      )   ♦     I     at  2b 

1  =  1 


We  note  that  in  response  to  a   query  of  the  form  B  •    F     •    Fp,    the   syst 
will  parenthesis  the  expression  as    (B«F  )«F  ,    or  rewrite  it  as 


em 


B   '    FxhF2    . 


3h 

same  as  the  cost  of  the  tree  corresponding  to 

B  *    (       Z       A.  )   +  B   •    (       Z       A.  ) 
VSA1  VSA2     " 

Kence  we  can  use  the  result  in  Theorem  3-1  and  obtain 

Corollary  3-5 

n 


When  b  >  Z  a.,  B-A^A  +A  +...+A  is  optimal  among  all  Boolean 

expressions  equivalent  to  it. 

Algroithm  A  can  be  modified  to  determine  an  optimal  equivalent  form 

n 
when  b  <  Z  a. . 
i=l  X 

Algorithm  Al 


n 


1.  If  b  >  Z  a..,    B  •  A,+Ap+...+A  is  optimal  among  all  Boolean  expressions 

i=l 
equivalent  to  it  and  an  optimal  merge  tree  is  T  as  shown  in  Fig.  3-9a. 

Otherwise,  we  have 

n 

2.  a.  b  <  Z  a..   Choose  the  Boolean  expression 

i=I    x 


B   •         Z      A.    •         I      A. 
VSA1  1       Ai£SA2   ' 

and  the  corresponding  tree  is  T     shown  in  Fig.  3-9b. 
b.     An  optimal  tree  can  be  obtained  by  repeating  steps  1  and/ or  2a  for 


B   •         Z        A.    and  then  for   (B   •         Z      A. )    •         Z      A.    . 
VSA1     "  V^l1         *l6Si2X 
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u 


(a) 


d 
(b) 


Figure  3-9 
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Similarly,  from  Theorem  3-2,  we  have 

Corollary  3-6 

m  n 

When     I     b.    =     Z     a.,    an    optimal  form  of  the  Boolean  expression 
i  i 

i=l  i=l 


(B1+B2+...+Bm)    •   A1+A2...+  An 


is 


(B1+B2+...Bm)  •  A1+A2+...+An 
The  algorithm  Bl  determines  an  optimal  merge  tree  when  Zb.  /  2  a.. 
Algorithm  Bl 


m      n 


i=l     i=l 


n       m 

1.  If  Z  a.  =  Z  b.,  the  optimal  merge  tree  is  T  in  Fig.  3- 10a  where 

1=1      1=1 

T„  and  T^  are  Huffman  trees  of  An,  A-,,  ....  A  and  Bn,  B~,  .  ..,  B  . 
A      B  1'  IS     '   n      1   2'    '  m 

Otherwise,  we  have 

n       m 

2.  If  Z  a.  >  Z  b.. 

.   ,     1       .   -,     1 

1  =  1  1  =  1 

a.      Choose  the  expression 


(       2      A    )    •    B  +B+...+B +    (       2      A    )    •   B1+B+...+B 
A.eSA1  a.£Sa2 


The  corresponding  merge  tree  is  T   in  Fig.  3- 10b. 


b.   For  each  of  the  terms  (   Z  A. )  •  B,+B_+...B  ,  if 

i      i   2     m 
A.eS.  . 
i  A3 


m 


Z   a.  >  T,    b.,  rewrite  the  term  as 


A.eSft  x   i=l  1 
1  Aj 


(    Z   A.  )  •  B,+B0+...+B  +  (    Z   A.  )  •  Bn+B0+...+B 

A.eSfl.n  1     1    2  m    A.eS    i     1  2 

l  Ajl  i  Aj2 


m 


where  S.  ._  and  S.  .„  are  the  sets  of  leaves  of  the  two  Huffman  subtrees  of  T.  .. 
Ajl     Aj2  Aj 
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u 
(a) 


(b) 


I 


i  1.1 I 
(c) 


Figure  3-10 
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m 


c.      Repeat  Step  2b  until  either        2      a.    <     2    b.    or  the  term     become 

A.eSA     x  ~  i=l     ± 
1     A3 


A.    •    Bn+B0+. . .+B    . 
j  1     2  m 


n  m 


3.      If     2     a.    <     2    b., 
i=l     X        i=l     x 


a.   Choose  the  Boolean  expression 


(A1+A2+...+An)  •  I   2   B±)  •  \       Z   B±) 


Bi6SBl       BieSB2 

and  the  corresponding  merge  tree  is  T,^  as  shown  in  Fig.  3-10c. 

&B 

n 


b.   For  each  of  the  terms    2  B.,  if  2  a.  <    2   b.,  rewrite  the 

B.eS,.  x     i=l  x   B.eS-,.  x 

term  as 


(  ~^  bTT  •  T       S  B~7 

B.eS_   1     B.eS,  n  ± 
l  Bjl        i  Bj2 


where  S, ..,  and.  S^  ._  are  sets  of  leaves  of  the  Huffman  subtrees  of  01.. 
BJ1      BJ2  Bj 

n 
c.   Repeat  Step  3b  until  either  2  a.  >   2  b.  or  when  S^  contains 

i=l  x  ~B.es,.  x  BJ 

one  single  term. 

Furthermore,  we  have 

Corollary  3-7 

The  cost  of  an  optimal  merge  tree  corresponding  to  the  expression 

m  I 

«±  (VV-'-^iJ  •  k!x  (Bii+Bi2+-+%^ 

is 

C  (m)  +2   2  b 
U      k=l  i=l  J6± 

where  C  (m)  is  the  cost  for  merging  the  lists  A. .  in  Eq.  (3-8). 

>J  J- J 
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IV.   Bounds  on  Sub  optimal  Parsing  Algorithms 

When  the  lengths  of  overlaps  cannot  be  neglected  in  the  computation  of 
merge  times,  the  algorithms  described  in  Section  III  are  no  longer  optimal. 
In  this  section,  we  derive  bounds  on  the  performance  of  these  algorithms  in 
terms  of  the  maximum  overlap  between  lists.   Again,  let  cr(A, B)  denote  the 
length  of  overlap  between  lists  A  and  B.   The  length  of  the  resultant  list 
obtained  by  OR  merging  lists  A  and  B  is  equal  to  a+b-cr(A,B).   The  lengths 
of  the  resultant  lists  are  a(A,  B)  and  a-cr(A,B)  when  the  lists  A  and  B  are 
AND  merged  and  AMD  NOT  merged,  respectively. 

Consider  a  set  of  lists  S  =  {A  ,Ap,...,A  }.   We  say  that  the  maximum 

length  of  overlap  between  them  is  a  if  a(A.,A.  )  <  a  for  all  A.  and  A .  in  3 

i  3     -  i      J 

Moreover,  let  S  and  S  be  any  two  disjoint  subsets  of  S,  and  R  and  R  be 
the  two  lists  obtained  by  OR  merging  lists  in  S,  and  S?,  respectively.   Then, 
o(R   ,  R  )  <   o.   Hence,  a  is  an  absolute  mea,-ur_'  of  the  maximum  length  of  overlap. 
It  is  a  meaningful  measure  when  the  lists  in  S  are  of  comparable  lengths  and 
that  their  intersections  are  relatively  small  compared  to  their  lengths. 

In  practice,  the  lengths  of  the  inverted  lists  may  differ  by  several 

t 
orders  of  magnitude.   The  length  of  overlap  between  any  pair  of  lists  is  often 

measured  in  terms  of  a  fraction  of  the  length  of  the  shorter  list.   Let 

<t>(A.,A.)  denote  the  fraction  such  that 

a(A.,A.)  -  <I>(A.,A.)  min[a.,a.] 


For  example,  from  MEDLAR  Master  Mesh,  we  found  that  the  length  of  the  list 
corresponding  to  the  index  term  HUMAN  is  ^93,  599  while  that  corresponding 
to  LUROVIN  is  only  k. 


ko 


for  any  A.,  A.  in  S  =  {A  ,A  ,  ...,A  }.   Let  *  denote  the  maximum  overlap  for 
the  set  S.   That  is,  a(A.,A.)  <  *  min[a.,a.]  and  a(R  ,  Rp )  <  <t>  min[r-.,rp] 
where  r,  and  r„  are  the  lengths  of  R  and  Rp,  respectively.  With  slight 
abuse  of  the  term,  we  also  call  4>  the  maximum  length  of  overlap. 

Again  let  C(T)  denote  the  cost  of  a  merge  tree  T(A  +Ap+...+A  ). 
Since  C(T)  is  the  total  merge  time  of  A  ,A?,...,A  when  their  merge  order 
is  specified  by  T,  C(T)  depends  on  the  lengths  of  An,Ap,...,A  as  well  as 
overlaps  between  them.   Let  P(t)  denote  the  weighted  path  length  of  the 
tree  T.  As  discussed  in  Section  III,  P(T)  is  the  cost  of  the  tree  T  when 
the  corresponding  Boolean  expression  is  A-+Ap+. ..+A  and  the  length  of  overlap 
between  the  lists  are  zero. 

Bounds  on  Cost  of  Huffman  Tree 

Let  Tn(An+Ap+ . .  .+A  )  be  an  optimal  merge  tree  for  the  lists 
A,,Ap,  ...,A  corresponding  to  the  Boolean  expression  A..+A  +. .  .+A  .  As 
demonstrated  by  the  example  in  Fig.  ^— 1,  T0  cannot  be  obtained  using 
Huffman's  procedure  in  general.   Let  Tu  denote  the  Huffman  tree  for 
A,,Ap, ...,A .  We  now  bound  the  cost  of  the  Huffman  tree  T„. 
Lemma  k.l. 

Let  S  =  {A„  ,Ap, . .  .,A  }  be  a  set  of  lists  and  R  be  the  list 

obtained,  by  OR  merging  all  the  A.  in  S.   The  length  of  the  list  R  ,  rn  , 

is  such  that 

k 
r  >  Z  a  -  (k-1)  a 
k   i=l  x 

where  a  is  the  maximum  overlap  between  the  lists  in  S.  Moreover,  the  bound 
is  tight. 


kl 


C(TH) 


2(2+3)  +  h 

11+ 


(a)   T,,,    Huffman  tree 
H 


C(TQ)    =3   +   1+  +   I+  +  2   =   13 


(b)     T  ,    Optimal  tree 


a(A1,A2)   =  3,  a(AL,A3)  =  0,  a(A2,A3)   =  0 


l,    —  H-,  ap   —  j,  a„   —  c. 


Figure  k-1 


Proof:   Let  us  consider  any  list  A.  in  the  set  S.   Without  loss  of 

1 

generality,  suppose  that  A.  DA.  ,  A.  HA.  ,  ...  A.  DA.   are  nonempty 

-L  tz.  K. 

(where  (~l  denotes  set  intersection).   By  definition  of  a,    the  total  number  of 

/\ 

elements  in  A.    D  A.    ,   A.    flA.    ,    . . ...   A.     DA.      is  at  most   a. 
i  i^      i  i2  i  ik 

Let 

I(A1)  =  0 
where  0  is  the  null  set  and 

I(A.)  =  A±    0  (A1  U^  U  ...  UAi_1)    i  =,  2,  3,  ....  k 

We  note  that  the  lists 

A  *    -  A 
Al       Al 

A^   =  A-,   -   l(Ag) 

are  disjoint.  Moreover,  their  lengths  a',  a',  . ..,  a/  are  such  that 

al  =  al 

^  >  a2  -  a 


Since  the  list  corresponding  to  A]  +  A '  +  ...  +  A'  is  R  ,  we  have 

k       k 

r.  =  E     a.'  >  E     a.  -  (k-l)  a 
k   .    i  -  .    i 
i=l      1=1 


We  point  out  here  that  throughout  our  discussion,  by  a  list,  we  mean  a 
sorted  list  of  distinct  elements.   Hence,  we  may  also  regard  a  list  as  a 
set  whenever  the  order  in  which  the  elements  appear  in  it  is  irrelevant. 


1+3 

That  the  bound  is  tight  can  be  demonstrated  by  the  example: 

A1  =  (a,p,7,xx,  ...,xm),  A2  =  (a,p,7,y.,_,  ...,yn)   ....  Afe  =  {a,^,7,z±, . . . ,z^) 

where  x.,  y.,  z.  are  all  distinct,   a  =  3  and  the  list  R,  is 

111  K 

k 

ia,p,y,x,...,xm,y1,  ...,yn,...,z1,...,Zg)   with  length  Z  a±  -  3(k-l).      ■ 

i=l 
In  terms  of  o,   we  have  the  following  bound  on  the  weighted  path 

length  of  a  Huffman  tree  for  a  set  of  lists  S  =  {A  ,A 2, ...,A  },  p(T  ). 

Theorem  k-1 

p(TH)  <  C(T)  +  S^  =£  3  C*-D 

where  a   is  the  maximum  overlap  for  lists  A... A-,..., A  and  C(T)  is  the  cost 

1     2  n 

of  any  tree  T  for  these  lists  corresponding  to  the  Boolean  expression 

A,+A„+. . ,+A  . 
12  n 

Proof:   Let  T  be  an  arbitrary  tree  for  the  set  of  lists  fA,,A0,...,A  1 

12  n 

corresponding  to  the  expression  A  +A  +...+A   .      Suppose  that  in  T  the   leaf 

A.    is  at   level  I.,    i   =  1.    2,    ....    n.      Let  T'   be  a  tree   obtained  from  T  when 
i  i 

n-1  /^ 

the  weight  of  A.  is  replaced  by  a.  -  a.   We  claim  that 

l  in 

P(T')  <  C(T)  (k-2) 

That  is, 

2     L(a..    --a)<  C(T) 
.,ii  n        '  —     v    ' 

i=l 

n  1     2 

Since  a  Huffman  tree  has  the  minimum  weighted  path  length  and      2     ii.    <  — (n  +n-l), 


i=i  l  -  2 


we  have 


p(TH)   -  |(n2+n-l)    •    ^a<C(T) 


and  hence  Eq.    (k-1). 


hk 


To  show  that  the  inequality  in  (^-2)  is  valid,  let  A  be  an  arbitrary 
list  in  the  set  {A  >A ,  .  ,A  }   Suppose  that  in  the  tree  T,  A  is  at  level  I 
as  shown  in  Fig.  k-2,   where  T  ,  T?,  ...,T«  -i>T«  are  subtrees  of  T.   Let 
S,,S  p,  .  ,.,S„  be  the  sets  of  leaves  and  R,,Bp,  . .  .,R„  be  the  lists  corresponding 
to  the  roots  of  T,,Tp, . . . ,  T»,  respectively.  We  note  that  the  weight  of  the 
node  j  in  T',  i\,    is 

J 

t!  =  a  +   2     Z    a.  -  —  a   (  Z   Is  I  +  l)        (1+-1 
J      k=J+l  A.eS    x        k=j+l 

1   K. 

On  the  other  hand,  from  Lemma  ^-1,  the  length  of  the  list  R.   ,  r.  ,  is 
such  that 

r.  n  >    2    a.  -  (|S.'  I  -  1)  a 

and  length  of  the  list     corresponding  to  the  node    (j+l)   is   lower  bounded  by 

a  +        2       (      2       a.  -  p|S J  ) 

k=j+2     A.eSk    x 

Hence  merge  time- required  to  generate  the  list  corresponding  to  the  node  j 
in  the  tree  T,  T.,  is  such  that 

*  ,     i        .       .  x- 

t.   >  a  +        Z  Z       a.    -    (      Z        I S    |    -   1 )   a 

3  k=j+l    A.eS.      x         k=g+l 

x     k 


Since 


n     1  £  £  t 

^(    ?      |sk|+D>(    s      |sj  -1)+ 

k=j+l  k=j+l 


t              n-1 
The  inequality  (x+1)  >  x  -  1  is  equivalent  to 

2n  >  x  +  1 

Since  we  have  x  <  n  and  n  >  1,  the  inequality  is  always  valid. 


^5 


0) 
0) 
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for  any  n  >  1,  for  all  j  =  0,  1,  . ..,  £-1,    t!.<  t.  and  the  inequality  in 

J    J 

(4-2)  follows.  ■ 

It  follows  from  Theorem  4-1  that  the  cost  of  a  Huffman  tree  is 
upper  bounded  by 
Corollary  4-1 

C(TH)  <  C(T0)  +  |(n2+n-l)  ^  a  (4-J 

where  the  tree  T  is  optimal. 

Although  the  bound  in  (4-4)  is  not  tight,  it  does  indicate  that  in 
most  cases  of  practical  interest,  the  cost  of  the  Huffman  tree  does  not 
differ  substantially  from  that  of  an  optimal  tree  for  processing  a  query  of 
the  form  A  +Ap+. . . +A  .   For  example,  when  the  length  of  the  lists  are 
approximately  equal. 

C(TQ)  ~  £(l  -  j)   ne^  n 


where  £   is  the  approximate  list  length.   Since  p(n  +n-l)  a  is 

12^ 
approximately  equal  to  —  n  a  for  large  n,  its  value  becomes  comparable  to 


C(T  )  only  when' 


2  %2  n 


In       £    /    Ov 
^  v1-    a  / 


a 

0  0 

(For  n  =  64,  7  ~  0.20;  for  n  =  8,  7  «  0.43.   That  is,  only  for  every  large 


overlaps. ) 

Similarly,  we  bound  the  cost  of  a  Huffman  tree  in  terms  of  the 
maximum  overlap  <!>. 


hi 

Lemma  k-2 

*   k 

i=l 

Again,  r  is  the  length  of  the  list  corresponding  to  the  expression 

Proof:   Let  A^  be  the  shortest  list  among  A  ,A p,  ...,A,  and  R    be 
the  list  corresponding  to  the  expression  A,+Ap+...+A,  , .   The  length  of 
R   ,  r  ..,  is  clearly  larger  than  the  length  of  any  particular  list 
A^Ag, ...,  or  Ak_1.   Hence 


rk  >  =  ai  "  «k 

i=l 


>  S  a.  (1  -  r)  * 

-i=1  i    k 

Hence,  we  obtain  the  following  bound  for  the  cost  of  a  Huffman  tree  for 

Lhc  lists  A,,A_, . . . ,A  . 
1  27    '  n 

Theorem  k-2 

C(T  )  <  P(T  )  <  ij—  c(T)  (^-5) 

1  -  —  <t> 

2 

where  T  is  any  arbitrary  tree  for  A,  +  Ap  *  ...  +  A  . 

Proof:   The  proof  is  similar  to  that  of  Theorem  k-1.      Suppose  that  T '  is  a 

/    1  ^\ 
tree  obtained  from  T  when  the  weight  of  A.  is  replaced  by  a.  (1  -  —  <t>J. 

We  claim  that 

P(T')  <  C(T)  (k-6) 

and  hence,  the  inequalities  in  (^+-5)  follows. 


1+8 


To  show  that  the  inequality   (k-6)   is  valid,    we  again  look  at  the 
tree  in  Pig.    k-2.      The  weight  of  the  node  j   in  the  tree  T',    t*o    is   such  that 


I 


t!   >  (a  +       Z 

3  N+1AA 


ai)(l-|*) 


On  the  other  hand,  the  length  of  the  list  R.  -. ,    r.  _,,  is  such  that 

0+1'   j+1' 


ri+l  -  L        ai  ^  "  Tq 1 

a+1   A.eS.  n  X     |Sd+l' 


and  the  length  of  the  list  corresponding  to  the  node  (j+1)  is  lower  bounded  by 

*  V 

a  +       Z  Z     a.      1 

k=j+2  A.eSn    V  \ 

J  l     k    /   \       1+       Z        S 


£ 


k=o+2 


k1 


Hence,  the  merge  time  required  to  generate  the  list  corresponding  to  the  node, 


t.»  is  such  that 

(  l 

t .  >  [a  +   I     Z   a.  1  min 


1  - 


0+1'  I 


1  - 


i  +  z  s 


k=o'+2 


k1 


The  inequality  in  (U-6)  follows  immediately  from  this  expression. 

We  again  note  that  the  bound  in  (^-5)  is  not  tight.   However, 
we  can  conclude  from  it  that  for  most  cases  of  practical  interest 
(<t>  <  0.1)  the  total  merge  time  is  quite  close  to  the  minimum  when  the  merge 
order  is  specified  by  a  Huffman  tree. 


h9 

Performance  of  Sub  optimal  Algorithms 

Clearly,  when  the  length  of  overlap  is  not  negligible,  the  algorithms 
described  in  Section  III  are  no  longer  optimal.   However,  we  have  the 
following  special  cases : 

Theorem  k-3 

n 

When  Z  a.  <  b,  (An+A_+...+A  )«B  is  optimal  among  all  Boolean 
.  ..   1  —  '    1  2      n 
i=l 

expressions  equivalent  to  it.   Hence,  an  optimal  merge  tree  is  T  as 

shown  in  Fig.  U-3a  when  T  is  an  optimal  merge  tree  for  (A  +Ap+...+A  ). 

Proof:   Let  C  (T)  denote  the  cost  of  the  tree  T  in  Fig.  U-3a.   Suppose 

that  T(A  +Ap+. ..+A  )  is  an  arbitrary  tree  for  the  lists  A  ,  Ap,  ...,  A  . 

Let  R  be  the  list  corresponding  to  the  Boolean  expression  A  +A-+...+A  and 

L  e:      n 

r  be  its  length.   Clearly, 

C  (T)  =  C(T)  +  r  +  b 

We  need  to  show  that  C.  (T)  is  less  than  or  equal  to  any  tree  T  ,  shown  in 

lv  '  ^  J  m+1 

Fig.  k-3b.      Let  R,,  FL,  ...,  R   ,  R  ,  R  0  be  the  lists  corresponding  to 

the  roots  of  T_,  T0,  ...,  T  ,,  T  ,,  T  ~  which  are  the  m+1  subtrees  of 
1'  2'  '      m-1'   ml   m2 

T(A.,+A„+ . .  .+A  )  obtained  by  removing  the  roots  of  larger  subtrees  of  T. 

Let  r-,,  r^}    •••>   r  -,,  r  -,,  r  0  be  the  lengths  of  these  lists,  respectively. 

1       2'  '      m-1       ml       m2  ^ 

Furthermore,    let  D..,    D„,    ...,    D     n,    D   ,,    D  0  be   the   lists   obtained  by  AND 
1       27  '      m-1       ml       m2  ° 

merging  R  ,    R  ,    ...,    R       ,    R     ,    R     ,    respectively,    with  the   list  B. 

The  cost  of  the  tree  T     , ,    C     .. ,    is 

m+1'      m+1 

m-1 

C      ,    =     £      [C(T.  )  +  r.  ]+C(t1)  +  C(l0)  +  r,+Pfl+    (m+1)  b 
in  i  i        .    .    L    v   i '     i  v  ml/  v  m2 /         ml         m2        v        ' 

i     I 


i-  c(t(d +d+...+d    .+d    +5 \    )) 

v        12  m-1     ml     m2 
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(a) 


m+1 
00 


Fig.   k-3 
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Let  T  be  a  subtree  of  T  obtained  by  OR  merging  R  n  and.  R  0  together.   We  have 
m  ml      m? 

m 
C(Tm)  =  2   [C(T,  )  +  r.  ]  +  mb  +  C(T(Dn+Dp+. .  .+Dj  ) 
i=l 


wheri  R  is  the  root  of  T  and  D  is  the  list  obtained  by  AND  merging  R  with  B, 
Since 

C(T  )  =  C(T  .)  +  C(T  0)  +  r   +  r 
nr      ml     v  m2 '         ml    m2 

C(T  .  )  -  C(T  )  =  b  -  (r  +r  _-r  ) 
v  m+ly    v  my        ml  m2  my 


+  C(T(D-.+LV...+D  _+D  n+D  „)) 
v  v  1  2      m-1  ml  m2 7  y 

-  C(T(D1+D2+...+Dm)) 

That  this  difference  is  larger  than  or  equal  to  zero  follows  from 

rn+r0-r     >0 
ml         m2         m  — 

n 

b  >    Z     a,    >  (r ■     fr     -r   ) 
—  .    -      l  —       ml     xad     m 
1=1 

and  that  D  is  a  list  obtained  by  OR  merging  the  lists  D  ,  and  D  ..   Hence, 
m  ml      md 

C(T  )  is  a  monotone  nondecreasing  function  of  m.   In  other  words,  for  any 

tree  T  ,  there  is  a  tree  T  whose  cost  is  less  than  C  .   Hence,  we  have  the 
m  u  m 

statement  of  the  theorem. 
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V.   Summary 

The  results  in  Section  III  allows  us  to  find  an  optimal  form  of 
any  nested  Boolean  expression  in  which  (i)  all  variables  are  distinct,  and 
(ii)  the  complement  of  any  variable,  B,  (or  any  expression)  can  appear 
only  in  the  form 

Al  '  *2  •  •••  '  An  •  5  | 

with  at  least  one  of  the  A. ' s  not  complemented.  We  already  noted  that  the 
restriction  (ii)  leads  to  no  real  loss  of  generality  since  it  indeed  is  a 
restriction  imposed  on  the  form  of  the  query  in  most  inverted  list  document 
retrieval  systems. 

The  problem  of  finding  the  optimal  forms  of  Boolean  expressions  in 
which  not  all  variables  are  distinct  is  more  difficult  than  that  of  finding 
the  minimum  gate  realization  of  the  Boolean  expression  (when  fan-in  of 
gates  are  2).   Our  problem  is  further  complicated  by  the  fact  that  if  one 
of  the  terms  A+B,  A*B,  A-B,  A*B  is  generated  by  merging  lists  A  and  B,  the 
other  three  terms  are  also  obtained  at  no  additional  cost. 

When  the  lengths  of  overlaps  between  lists  cannot  be  neglected  in 
computing  the  total  merge  time,  the  Boolean  expression  determined  by  the 
algorithms  described  in  Section  III  are  no  longer  optimal.  The  performance 
of  these  algorithms  can  be  bounded  in  terms  of  the  maximum  overlap  between 
the  lists  as  done  in  Section  IV. 


53 
References 

[1]  Hsiao,  D.  and  Prywes,  N.  S.,  "A  System  to  Manage  an  Information  System," 
in  Proc  of  the  FID/ IF IP  Joint  Conference  on  Mechanized  Information 
Storage,  Retrieval  and  Dissemination,  Rome,  Italy,  1967. 

[2]  Hsiao,  D.  and  Harary,  F. ,  "A  Formal  System  for  Information  Retrieval 
from  Files, "  Comm.  ACM,  Vol.  13,  No.  2,  February,  1970. 

[3]  Martin,  L.  D. ,  "A  Model  for  File  Structure  Determination  for  Large 
On-line  Data  Files, "  in  Proc  of  the  FILE  68  International  Seminar  on 
File  Organization,  Copenhagen,  1968. 

[k]      Prywes,  N.  S.,  "Man-Computer  Problem  Solving  with  Multilists," 
Proc.  IEEE  jk,    12,  December,  1966. 

[5]   Cardinas,  A.  F. ,  "Analysis  and  Performance  of  Inverted  Data  Base 
Structures,  "  Comm.  ACM,  Vol.  18,  No.  5,  May,  1975. 

[6]  Lowe,  T.  C,  "The  Influence  of  Data  Base  Characteristics  and  Usage 

on  Direct  Access  File  Organization, "  J.  ACM,  Vol.  15,  No.  4,  October,  1968, 

[7]   Liu,  Jane  W.  S.,  "Probabilistic  Models  of  Inverted  File  Document 

Retrieval  Systems,  "  Technical  Report,  UIUCDCS-R-75-7142,  University  of 
Illinois,  Department  of  Computer  Science. 

[8]   Liu,  Jane  W.  S.,  "Algorithms  for  Parsing  Search  Queries  in  Inverted 
File  Document  Retrieval  Systems, "  Technical  Report  UIUCDCS-R-75-718, 
University  of  Illinois,  Department  of  Computer  Science,  Urbana, 
Illinois,  September,  1975. 

[9]  Knuth,  D.  E.,  The  Art  of  Computer  Programming,  Fundamental  Algorithms, 
Vol.  1,  pp.  1+02-^05,  I968. 


IBLIOGRAPHIC  DATA 
1EET 


1.   Report  No. 

UIUCDCS-R-75-718 


3.  Recipient's  Accession  No. 


Title  and  Subtitle 

Algorithms  for  Parsing  Search  Queries  in  Inverted 
File  Document  Retrieval  Systems 


5.   Report  Date 

November,    1975 


Author(s) 

Jane  W.    S.    Liu 


8-    Performing  Organization  Rept. 
No. 


Performing  Organization  Name  and  Address 

Department  of  Computer  Science 
University  of  Illinois 
Urbana,  Illinois 


10.   Project/Task/Work  Unit  Nc 


11.  Contract/Grant  No. 

NSF  DCR  73-07980 
NSF  DCR  72-037^0  A01 


sponsoring  Organization  Name  and  Address 

National  Science  Foundation 
Washington,    DC 


13.   Type  of  Report  &  Period 
Covered 


14. 


.:pplc-mt.  rttary  Notes 


.  Abstract  s 

In  an  inverted  file  document  retrieval  system,  a  query  is  in  the  form  of  a  Boolean 
xpression  of  index  terms.   In  response  to  a  query,  the  system  accesses  the  inverted 
ists  corresponding  to  the  index  terms,  merges  them  and  selects  from  the  merged  list 
hose  documents  that  satisfy  the  search  logic.   In  this  paper,  we  consider  the  problem 
f  determining  a  Boolean  expression  which  leads  to  the  minimum  total  merge  time  among 
11  Boolean  expressions  that  are  equivalent  to  the  expression  given  in  the  query.   This 
roblem  is  the  same  as  finding  an  optimal  merge  tree  among  all  trees  that  realize  the 
ruth  function  determined  by  the  Boolean  expression  in  the  query.   Several  algorithms 
hich  generate  optimal  merge  trees,  when  the  sizes  of  overlaps  between  different  lists 
re  small  compared  with  the  length  of  the  lists,  are  described.   These  algorithms  are 
0  longer  optimal  when  the  lengths  of  overlaps  cannot  be  neglected.   In  this  case,  it 
s  possible  to  bound  the  performance  of  these  algorithms  in  some  instances  in  terms  of 
he  maximum  overlap  between  lists.   The  performance  bounds  are  discussed. 

.  Key  Words  and  Document  Analysis.     17a.   Descriptors 

inverted  file 
document  retrieval  system 
merge  algorithm 
parsing  Boolean  query 


b.  Idcmifiers/Open-Ended  Terms 


c-  <  OSAIT  Field/Group 


1     "lability  Statement 


19.  Security  (lass  (This 
Report  ) 

UN(   1.ASS1H1-D 


20.  Security  (  lass  (This 

Page 

UNCLASSIFIED 


21.   No.  of  Pages 


22.   P 


HM  NTIS-3B 


USCOMM-DC    40329-P7I 


